Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement iterators over .hic files #2

Merged
merged 50 commits into from
Jun 20, 2023
Merged

Implement iterators over .hic files #2

merged 50 commits into from
Jun 20, 2023

Conversation

robomics
Copy link
Contributor

@robomics robomics commented Jun 9, 2023

Rework hic library to support efficient iteration over pixels overlapping a given query.

Move MatrixType and MatrixUnit params from MatrixSelector to HiCFile
ctor.
@codecov
Copy link

codecov bot commented Jun 9, 2023

Codecov Report

Merging #2 (404be67) into main (b644236) will decrease coverage by 0.37%.
The diff coverage is 84.56%.

@@            Coverage Diff             @@
##             main       #2      +/-   ##
==========================================
- Coverage   81.45%   81.09%   -0.37%     
==========================================
  Files          47       50       +3     
  Lines        3441     3285     -156     
==========================================
- Hits         2803     2664     -139     
+ Misses        638      621      -17     
Impacted Files Coverage Δ
src/cooler/balancing_impl.hpp 59.04% <ø> (ø)
src/cooler/dataset_accessors_impl.hpp 71.87% <ø> (ø)
src/cooler/dataset_impl.hpp 84.81% <ø> (ø)
src/cooler/dataset_iterator_impl.hpp 82.20% <ø> (ø)
src/cooler/file_read_impl.hpp 75.25% <0.00%> (ø)
src/cooler/file_standard_attr_impl.hpp 54.05% <ø> (ø)
src/cooler/file_validation_impl.hpp 48.64% <ø> (ø)
src/cooler/file_write_impl.hpp 75.46% <ø> (ø)
src/cooler/include/hictk/cooler.hpp 100.00% <ø> (ø)
src/cooler/include/hictk/cooler/dataset.hpp 100.00% <ø> (ø)
... and 30 more

... and 4 files with indirect coverage changes

Our previous implementation performed quite poorly at high resolutions,
as we were processing one row at a time while paying the overhead of
fetching the block index and data every row.
This was done with the intention of minimizing the amount of read-ahead
we do, as well as getting pixels in the correct order without doing any
explicit sort.

However this was too slow.
The current solution is less clean but performance is much better (15x
at 10bp on some of our datasets).
Instead of processing one row at a time, we now process rows in chunks.
Chunk sizes are computed as a fraction of chromosome sizes, and thus
grow linearly with resolution, making the overhead to process a chunk
comparable across resolution.

Another benefit of this approach is that indexing of InteractionBlock is
no longer needed: sorting pixels is enough.
@robomics robomics changed the title Impl hic lazy fetch Implement iterators over .hic files Jun 20, 2023
@robomics robomics merged commit 43f68a1 into main Jun 20, 2023
@robomics robomics deleted the impl-hic-lazy-fetch branch June 21, 2023 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant